69 research outputs found

    Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space

    Full text link
    We present a framework for discriminative sequence classification where the learner works directly in the high dimensional predictor space of all subsequences in the training set. This is possible by employing a new coordinate-descent algorithm coupled with bounding the magnitude of the gradient for selecting discriminative subsequences fast. We characterize the loss functions for which our generic learning algorithm can be applied and present concrete implementations for logistic regression (binomial log-likelihood loss) and support vector machines (squared hinge loss). Application of our algorithm to protein remote homology detection and remote fold recognition results in performance comparable to that of state-of-the-art methods (e.g., kernel support vector machines). Unlike state-of-the-art classifiers, the resulting classification models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem

    Generating Concise and Readable Summaries of XML Documents

    Full text link
    XML has become the de-facto standard for data representation and exchange, resulting in large scale repositories and warehouses of XML data. In order for users to understand and explore these large collections, a summarized, bird's eye view of the available data is a necessity. In this paper, we are interested in semantic XML document summaries which present the "important" information available in an XML document to the user. In the best case, such a summary is a concise replacement for the original document itself. At the other extreme, it should at least help the user make an informed choice as to the relevance of the document to his needs. In this paper, we address the two main issues which arise in producing such meaningful and concise summaries: i) which tags or text units are important and should be included in the summary, ii) how to generate summaries of different sizes.%for different memory budgets. We conduct user studies with different real-life datasets and show that our methods are useful and effective in practice

    Statistical learning techniques for text categorization with sparse labeled data

    Get PDF
    Many applications involve learning a supervised classifier from very few explicitly labeled training examples, since the cost of manually labeling the training data is often prohibitively high. For instance, we expect a good classifier to learn our interests from a few example books or movies we like, and recommend similar ones in the future, or we expect a search engine to give more personalized search results based on whatever little it learned about our past queries and clicked documents. There is thus a need for classification techniques capable of learning from sparse labeled data, by exploiting additional information about the classification task at hand (e.g., background knowledge) or by employing more sophisticated features (e.g., n-gram sequences, trees, graphs). In this thesis, we focus on two approaches for overcoming the bottleneck of sparse labeled data. We first propose the Inductive/Transductive Latent Model (ILM/TLM), which is a new generative model for text documents. ILM/TLM has various building blocks designed to facilitate the integration of background knowledge (e.g., unlabeled documents, ontologies of concepts, encyclopedia) into the process of learning from small training data. Our method can be used for inductive and transductive learning and achieves significant gains over state-of-the-art methods for very small training sets. Second, we propose Structured Logistic Regression (SLR), which is a new coordinate-wise gradient ascent technique for learning logistic regression in the space of all (word or character) sequences in the training data. SLR exploits the inherent structure of the n-gram feature space in order to automatically provide a compact set of highly discriminative n-gram features. Our detailed experimental study shows that while SLR achieves similar classification results to those of the state-of-the-art methods (which use all n-gram features given explicitly), it is more than an order of magnitude faster than its opponents. The techniques presented in this thesis can be used to advance the technologies for automatically and efficiently building large training sets, therefore reducing the need for spending human computation on this task.Viele Anwendungen benutzen Klassifikatoren, die auf dünn gesäten Trainingsdaten lernen müssen, da es oft aufwändig ist, Trainingsdaten zur Verfügung zu stellen. Ein Beispiel für solche Anwendungen sind Empfehlungssysteme, die auf der Basis von sehr wenigen Büchern oder Filmen die Interessen des Benutzers erraten müssen, um ihm ähnliche Bücher oder Filme zu empfehlen. Ein anderes Beispiel sind Suchmaschinen, die sich auf den Benutzer einzustellen versuchen, auch wenn sie bisher nur sehr wenig Information über den Benutzer in Form von gestellten Anfragen oder geklickten Dokumenten besitzen. Wir benötigen also Klassifikationstechniken, die von dünn gesäten Trainingsdaten lernen können. Dies kann geschehen, indem zusätzliche Information über die Klassifikationsaufgabe ausgenutzt wird (z.B. mit Hintergrundwissen) oder indem raffiniertere Merkmale verwendet werden (z.B. n-Gram-Folgen, Bäume oder Graphen). In dieser Arbeit stellen wir zwei Ansätze vor, um das Problem der dünn gesäten Trainingsdaten anzugehen. Als erstes schlagen wir das Induktiv-Transduktive Latente Modell (ILM/TLM) vor, ein neues generatives Modell für Text-Dokumente. Das ILM/TLM verfügt über mehrere Komponenten, die es erlauben, Hintergrundwissen (wie z.B. nicht Klassifizierte Dokumente, Konzeptontologien oder Enzyklopädien) in den Lernprozess mit einzubeziehen. Diese Methode kann sowohl für induktives als auch für transduktives Lernen eingesetzt werden. Sie schlägt die modernsten Alternativmethoden signifikant bei dünn gesäten Trainingsdaten. Zweitens schlagen wir Strukturierte Logistische Regression (SLR) vor, ein neues Gradientenverfahren zum koordinatenweisen Lernen von logistischer Regression im Raum aller Wortfolgen oder Zeichenfolgen in den Trainingsdaten. SLR nutzt die inhärente Struktur des n-Gram-Raums aus, um automatisch hoch-diskriminative Merkmale zu finden. Unsere detaillierten Experimente zeigen, dass SLR ähnliche Ergebnisse erzielt wie die modernsten Konkurrenzmethoden, allerdings dabei um mehr als eine Größenordnung schneller ist. Die in dieser Arbeit vorgestellten Techniken verbessern das Maschinelle Lernen auf dünn gesäten Trainingsdaten und verringern den Bedarf an manueller Arbeit

    Fast logistic regression for text categorization with variable-length n-grams

    Get PDF
    A common representation used in text categorization is the bag of words model (aka. unigram model). Learning with this particular representation involves typically some preprocessing, e.g. stopwords-removal, stemming. This results in one explicit tokenization of the corpus. In this work, we introduce a logistic regression approach where learning involves automatic tokenization. This allows us to weaken the a-priori required knowledge about the corpus and results in a tokenization with variable-length (word or character) n-grams as basic tokens. We accomplish this by solving logistic regression using gradient ascent in the space of all n-grams. We show that this can be done very efficiently using a branch and bound approach which chooses the maximum gradient ascent direction projected onto a single dimension (i.e., candidate feature). Although the space is very large, our method allows us to investigate variable-length n-gram learning. We demonstrate the efficiency of our approach compared to state-of-the-art classifiers used for text categorization such as cyclic coordinate descent logistic regression and support vector machines

    Learning word-to-concept mappings for automatic text classification

    Get PDF
    For both classification and retrieval of natural language text documents, the standard document representation is a term vector where a term is simply a morphological normal form of the corresponding word. A potentially better approach would be to map every word onto a concept, the proper word sense and use this additional information in the learning process. In this paper we address the problem of automatically classifying natural language text documents. We investigate the effect of word to concept mappings and word sense disambiguation techniques on improving classification accuracy. We use the WordNet thesaurus as a background knowledge base and propose a generative language model approach to document classification. We show experimental results comparing the performance of our model with Naive Bayes and SVM classifiers

    Back to Basics: A Sanity Check on Modern Time Series Classification Algorithms

    Full text link
    The state-of-the-art in time series classification has come a long way, from the 1NN-DTW algorithm to the ROCKET family of classifiers. However, in the current fast-paced development of new classifiers, taking a step back and performing simple baseline checks is essential. These checks are often overlooked, as researchers are focused on establishing new state-of-the-art results, developing scalable algorithms, and making models explainable. Nevertheless, there are many datasets that look like time series at first glance, but classic algorithms such as tabular methods with no time ordering may perform better on such problems. For example, for spectroscopy datasets, tabular methods tend to significantly outperform recent time series methods. In this study, we compare the performance of tabular models using classic machine learning approaches (e.g., Ridge, LDA, RandomForest) with the ROCKET family of classifiers (e.g., Rocket, MiniRocket, MultiRocket). Tabular models are simple and very efficient, while the ROCKET family of classifiers are more complex and have state-of-the-art accuracy and efficiency among recent time series classifiers. We find that tabular models outperform the ROCKET family of classifiers on approximately 19% of univariate and 28% of multivariate datasets in the UCR/UEA benchmark and achieve accuracy within 10 percentage points on about 50% of datasets. Our results suggest that it is important to consider simple tabular models as baselines when developing time series classifiers. These models are very fast, can be as effective as more complex methods and may be easier to understand and deploy

    AMEE: A Robust Framework for Explanation Evaluation in Time Series Classification

    Full text link
    This paper aims to provide a framework to quantitatively evaluate and rank explanation methods for the time series classification task, which deals with a prevalent data type in critical domains such as healthcare and finance. The recent surge of research interest in explanation methods for time series classification has provided a great variety of explanation techniques. Nevertheless, when these explanation techniques disagree on a specific problem, it remains unclear which of them to use. Comparing the explanations to find the right answer is non-trivial. Two key challenges remain: how to quantitatively and robustly evaluate the informativeness (i.e., relevance for the classification task) of a given explanation method, and how to compare explanation methods side-by-side. We propose AMEE, a Model-Agnostic Explanation Evaluation framework for quantifying and comparing multiple saliency-based explanations for time series classification. Perturbation is added to the input time series guided by the saliency maps (i.e., importance weights for each point in the time series). The impact of perturbation on classification accuracy is measured and used for explanation evaluation. The results show that perturbing discriminative parts of the time series leads to significant changes in classification accuracy. To be robust to different types of perturbations and different types of classifiers, we aggregate the accuracy loss across perturbations and classifiers. This allows us to objectively quantify and rank different explanation methods. We provide a quantitative and qualitative analysis for synthetic datasets, a variety of UCR benchmark datasets, as well as a real-world dataset with known expert ground truth.Comment: Pre-prin
    corecore